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ABSTRACT 

Rapid detection of compact binary coalescence (CBC) with a network of advanced 
gravitational-wave detectors will offer a unique opportunity for multi-messenger astronomy. 
Prompt detection alerts for the astronomical community might make it possible to observe the 
onset of electromagnetic emission from CBC. We demonstrate a computationally practical fil- 
tering strategy that could produce early-warning triggers before gravitational radiation from the 
final merger has arrived at the detectors. 

gamma-ray burst: general — gravitational waves — methods: data analysis — meth- 



tem eventually merges. If a neutron star (NS) 
is involved, it might become tidally disrupted 
near the merger and fuel an electromagnetic (EM) 
counterpart (Shibata & Taniguchi 2008). Effort 
from both the GW and the broader astronomical 
communities might make it possible to use GW ob- 
servations as early warning triggers for EM follow- 
up. In the first generation of ground-based laser 
interferometers, the GW community initiated a 
project to send alerts when potential GW tran- 
sients were observed in order to trigger follow-up 
observations by EM telescopes. The typical la- 
tencies were 30 minutes (Huglicy 2011). This was 
an important achievement, but too late to catch 
any prompt optical flash or the onset of an on- 
axis optical afterglow. Since the GW signal is 
in principle detectable even before the tidal dis- 
ruption, one might have the ambition of reporting 
GW candidates not minutes after the merger, but 
seconds before. We explore one essential ingredi- 
ent of this problem, a computationally inexpen- 
sive latency free, real-time filtering algorithm for 
detecting inspiral signals in GW data. We also 
consider the prospects for advanced GW detec- 
tors and discuss other areas of work that would be 
required for rapid analysis. 

Compact binary coalescence (CBC) is a plau- 
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1. Introduction 



As a compact binary system loses energy to 
gravitational waves (GWs), its orbital separation 
decays, leading to a runaway inspiral with the GW 
amplitude and frequency increasing until the sys- 
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sible progenitor for most short gamma-ray bursts 
(short GRBs; Leo ct al. 2005; Nakar 2007), but the 
association is not iron-clad (Virgih ct al. 2011). 
The tidally disrupted material falls onto the newly 
formed, rapidly spinning compact object and is 
accelerated in jets along the spin axis with a 
timescale of 0.1-1 s after the merger (.Janka et al. 
1999), matching the short GRB duration distri- 
bution well. Prompt EM emission including the 
GRB can arise as fast outflowing matter col- 
lides with slower matter ejected earlier in inner 
shocks. The same inner shocks, or potentially re- 
verse shocks, can produce an accompanying opti- 
cal flash (Sari fc Piran 1999). The prompt emis- 
sion is a probe into the extreme initial conditions 
of the outflow, in contrast with afterglows, which 
arise in the external shock with the local medium 
and are relatively insensitive to initial conditions. 
Optical flashes have been observed for a hand- 
ful of long GRBs (Attcia & Boer 2011) by tele- 
scopes with extremely rapid response or, in the 
case of GRB 080319b, by pure serendipity, where 
several telescopes were already observing the af- 
terglow of another GRB in the same field of view 
(FOV; Racusin ct al. 2008). The observed opti- 
cal flashes peaked within tens of seconds and de- 
cayed quickly. For short GRB energy balance and 
plasma density, however, the reverse shock model 
predicts a peak flux in radio, approximately 20 
minutes after the GRB, but also a relatively faint 
optical flash (Nakar 2007); for a once-per-year Ad- 
vanced LIGO event at 130 Mpc, the radio flux will 
peak around 9 GHz at ~5 mJy, with emission in 
the i?-band at ~19 mag. Interestingly, roughly a 
quarter to half of the observed short GRBs also 
exhibit extended X-ray emission of 30-100 s in 
duration beginning ~10 s after the GRB and car- 
rying comparable fluence to the initial outburst. 
This can be explained if the merger results in the 
formation of a proto-magnetar that interacts with 
ejecta (Bucciantini ct al. 2012). Rapid GW alerts 
would enable joint EM and GW observations to 
confirm the short GRB-CBC link and allow the 
early EM observation of exceptionally nearby and 
thus bright events. 

In 2010 October, LIGO^ completed its sixth 
science run (S6) and Virgo^ completed its third 
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science run (VSR3). While both LIGO detec- 
tors and Virgo were operating, several all-sky de- 
tection pipelines operated in a low-latency con- 
figuration to send astronomical alerts, namely. 
Coherent WaveBurst (cWB), Omega, and Multi- 
Band Template Analysis (MBTA; Hughey 2011; 
Abadic et al. 2011a, 2012, 2011c). cWB and 
Omega are both unmodeled searches for bursts 
based on time-frequency decomposition of the 
GW data. MBTA is a novel kind of template- 
based inspiral search that was purpose-built for 
low latency operation. MBTA achieved the best 
GW trigger-generation latencies of 2-5 minutes. 
Alerts were sent with latencies of 30-60 minutes, 
dominated by human vetting. Candidates were 
sent for EM follow-up to several telescopes; Swift, 
LOFAR, ROTSE, TAROT, QUEST, SkyMapper, 
Liverpool Telescope, Pi of the Sky, Zadko, and 
Palomar Transient Factory (Kanner et al. 2008; 
HTighcy 2011) imaged some of the most likely sky 
locations. 

There were a number of sources of latency asso- 
ciated with the search for CBC signals in S6/VSR3 
(Hiighey 2011), listed here. 

Data acquisition and aggregation (>100 ms) 

The LIGO data acquisition system collects data 
from detector subsystems 16 times a second (Bork ct al. 
2001). Data are also copied from all of the GW 
observatories to the analysis clusters over the In- 
ternet, which is capable of high bandwidth but 
only modest latency. Together, these introduce 
a latency of ^100 ms. These technical sources of 
latency could be reduced with significant engineer- 
ing and capital investments, but they are minor 
compared to any of the other sources of latency. 

Data conditioning (~^1 min) Science data 
must be calibrated using the detector's frequency 
response to gravitational radiation. Currently, 
data are calibrated in blocks of 16 s. Within 
~1 minute, data quality is assessed in order to 
create veto flags. These are both technical sources 
of latency that might be addressed with improved 
calibration and data quality software for advanced 
detectors. 

Trigger generation (2—5 min) Low-latency 
data analysis pipelines deployed in S6/VSR3 



achieved an impressive latency of minutes. How- 
ever, second to the human vetting process, this 
dominated the latency of the entire EM follow-up 
process. Even if no other sources of latency ex- 
isted, this trigger generation latency is too long 
to catch prompt or even extended emission. Low- 
latency trigger generation will become more chal- 
lenging with advanced detectors because inspiral 
signals will stay in band up to 10 times longer. In 
this work, we will focus on reducing this source of 
latency. 

Alert generation (2—3 min) S6/VSR3 saw 
the introduction of low-latency astronomical 
alerts, which required gathering event parame- 
ters and sky localization from the various online 
analyses, downselecting the events, and calculat- 
ing telescope pointings. If other sources of latency 
improve, the technical latency associated with this 
infrastructure could dominate, so work should be 
done to improve it. 

Human validation (10 20 min) Because 
the new alert system was commissioned during 
S6/VSR3, all alerts were subjected to quality con- 
trol checks by human operators before they were 
disseminated. This was by far the largest source of 
latency during S6/VSR3. Hopefully, confidence in 
the system will grow to the point where no human 
intervention is necessary before alerts are sent, so 
we give it no further consideration here. 

This work will focus on reducing the latency of 
trigger production. Data analysis strategies for 
advance detection of CBCs will have to strike a 
balance between latency and throughput. CBC 
searches consist of banks of matched filters, or 
cross-correlations between the data stream and a 
bank of nominal "template" signals. There are 
many difi^erent implementations of matched filters, 
but most have high throughput at the cost of high 
latency, or low latency at the cost of low through- 
put. The former are epitomized by the overlap- 
save algorithm for frequency-domain (FD) con- 
volution, currently the preferred method in GW 
searches. The most obvious example of the latter 
is direct time domain (TD) convolution, which is 
latency-free. However, its cost in floating point 
operations per second is linear in the length of the 



templates, so it is prohibitively expensive for long 
templates. The computational challenges of low- 
latency CBC searches are still more daunting for 
advanced detectors for which the inspiral signal re- 
mains in band for a large fraction of an hour (see 
the Appendix). 

Fortunately, the morphology of inspiral signals 
can be exploited to offset some of the compu- 
tational complexity of known low-latency algo- 
rithms. First, the signals evolve slowly in fre- 
quency, so that they can be broken into contiguous 
band-limited time intervals and processed at pos- 
sibly lower sample rates. Second, inspiral filter 
banks consist of highly similar templates, admit- 
ting methods such as the singular value decompo- 
sition (SVD) (Cannon et al. 2010) or the Gram- 
Schmidt process (Field et al. 2011) to reduce the 
number of templates. 

Several efforts that exploit one or both of these 
properties are under way to develop low-latency 
CBC search pipelines with tractable computing re- 
quirements. One example is MBTA (Marion & the Virgo Collaboratio 
2003; Buskulic et al. 2010), which was deployed 
in S6/VSR3. MBTA consists of multiple, usu- 
ally two, template banks for different frequency 
bands, one which is matched to the early inspi- 
ral and the other which is matched to the late 
inspiral. An excursion in the output of any filter 
bank triggers coherent reconstruction of the full 
matched filtered output. Final triggers are built 
from the reconstructed matched filter output. An- 
other novel approach using networks of parallel, 
second-order infinite impulse response (HR) fil- 
ters is being explored by Hooper et al. (2010) and 
Luan et al. (2011). 

We will use both properties to demonstrate that 
a very low latency detection statistic is possible 
with current computing resources. Assuming the 
other technical sources of latency can be reduced 
significantly, this could make it possible to send 
prompt alerts to the astronomical community. 

The paper is organized as follows. First, we dis- 
cuss prospects for early-warning detection. Then, 
we provide an overview of our novel method for 
detecting CBC signals near real-time. We then 
describe a prototype implementation using open 
source signal processing software. To validate our 
approach we present a case study focusing on a 
particular subset of the NS-NS parameter space. 
We conclude with some remarks on what remains 



to prepare for the advanced detector era. 

2. Prospects for early-warning detection 
and EM follow-up 
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Fig. 1. — Expected number of NS-NS sources 
that could be detectable by Advanced LIGO a 
given number of seconds before coalescence. The 
heavy solid line corresponds to the most probable 
yearly rate estimate from Abadic ct al. (201Ub). 
The shaded region represents the 5%-95% confi- 
dence interval arising from substantial uncertainty 
in predicted event rates. 

Before the GW signal leaves the detection band, 
we can imagine examining the signal-to-noise ratio 
(S/N) accumulated up to that point and if it is 
already significant, release an alert immediately, 
trading S/N and sky localization accuracy for pre- 
merger detection. 

In the quadrupole approximation, the instanta- 
neous frequency of the GW inspiral signal is re- 
lated to the time t relative to coalescence (section 
5.1 of Sathyaprakash & Schutz 2009) through 
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where M = M^^^fj."^^^ is the chirp mass of the bi- 
nary, Mt = GM./c^ is the chirp mass in units of 
time, M is the total mass, and /i is the reduced 
mass. The expected value of the single-detector 
S/N for an optimally oriented (source at detec- 
tor's zenith or nadir, orbital plane face-on) inspiral 
source is (Abadic et al. 2010b) 



(2) 




where D is the luminosity distance and S{f) is 
the one-sided power spectral density of the detec- 
tor noise, /low and /high are low- and high- fre- 
quency limits of integration which may be chosen 
to extend across the entire bandwidth of the de- 
tector. If we want to trigger at a time t before 
merger, then we must cut off the SNR integration 
at /high = f{t) with fit) given by Equation (1) 
above. 

Figure 1 shows projected early detectability 
rates for NS-NS binaries in Advanced LIGO as- 
suming the anticipated detector sensitivity for 
the 'zero detuning, high power' configuration de- 
scribed in Shoemaker (2010) and NS-NS merger 
rates estimated in Abadic ct al. (20101)). The 
merger rates have substantial measurement uncer- 
tainty due to the small sample of known double 
pulsar systems that will merge within a Hubble 
time; they also have systematic uncertainty due 
to sensitive dependence on the pulsar luminosity 
distribution function (Kalogcra ct al. 2004). The 
most probable estimates indicate that at a single- 
detector S/N threshold of 8, we will observe a to- 
tal of 40 events yr^^; ~10 yr^^ will be detectable 
within 10 s of merger and ^5 yr~^ will be de- 
tectable within 25 s of merger if analysis can pro- 
ceed with near zero latency. 

We emphasize that any practical GW search 
will include technical delays due to light travel 
time between the detectors, detector infrastruc- 
ture, and the selected data analysis strategy. Fig- 
ure 1 must be understood in the context of all of 
the potential sources of latency, some of which are 
avoidable and some of which are not. 

EM follow-up requires estimating the location 
of the GW source. The localization uncertainty 
can be estimated from the uncertainty in the time 
of arrival of the GWs, which is determined by the 
signal's effective bandwidth and S/N (Fairhurst 
2009). Table 1 and Figure 2 show the estimated 
90% confidence area versus time of the loudest coa- 
lescence events detectable by Advanced LIGO and 
Advanced Virgo. This is the minimum area; local- 
ization is best at high elevation from the plane con- 
taining the detectors and worst at zero elevation. 
Fairhurst also cautions that his Fisher matrix cal- 
culation fails to capture disconnected patches of 
probability, which occur prominently in networks 
of three detectors where there are generally two 
local maxima on opposite sides of the plane of 



Table 1: Horizon Distance, S/N at Merger, and 
Area of 90% Confidence at Selected Times Before 
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Note. — A dash ( ) signifies that the confidence area 

is omitted because at the indicated time the SNR would 
not have crossed the detection threshold of 8. 



the detectors. Aside from the mirror degeneracy, 
characterizing the uncertainty region by the Fisher 
matrix alone tends to overestimate, rather than 
underestimate, the area for low-S/N events, but 
this effect is generally more than compensated by 
the source being in an unfavorable sky location. 
For these reasons, the localization uncertainty es- 
timated from timing is highly optimistic and will 
only suffice for an order-of-magnitudc estimate. 
Once per year, we expect to observe an event with 
a final single-detector S/N of sa27 whose location 
can be constrained to about 1300 deg^ (3.1% of 
the sky) within 25 s of merger, 260 deg^ (0.63% 
of the sky) within 10 s of merger, and 0.82 deg^ 
(0.0020% of the sky) at merger. 

It is unfeasible to search hundreds of square de- 
grees for a prompt counterpart. For comparison to 
some examples of modern ground-based wide-field 
survey instruments, the Palomar Transient Fac- 
tory P48 (Law et al. 2009) has a 3.50 x 2.31 deg^ 
FOV; the Pan-STARRS PI (Kaiser ct al. 2002) 
has a 7 deg^ FOV. Even the eagerly awaited 
LSST will have an FOV of 9.6 deg^ (Ivczic ct al. 
2008). However, it is possible to reduce the lo- 
calization uncertainty by only looking at galax- 
ies from a catalog that lie near the sky location 
and luminosity distance estimate from the GW 
signal (Xuttall & Sutton 2010) as was done in 
S6/VSR3. Within the expected Advanced LIGO 
NS-NS horizon distance, the number of galaxies 
that can produce a given signal amplitude is much 
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Area of the 90% confidence region as 
a function of time before coalescence for sources 
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and 0.1 yr~^. The heavy dot indicates the time 
at which the accumulated S/N exceeds a single- 
detector threshold of 8. 



larger than in Initial LIGO and thus the catalog 
will not be as useful for downselecting pointings 
for most events. However, exceptional GW sources 
will necessarily be extremely nearby. Within this 
reduced volume there will be fewer galaxies to con- 
sider for a given candidate and catalog complete- 
ness will be less of a concern. This should reduce 
the 90% confidence area substantially. 

3. Novel real-time algorithm for CBC de- 
tection 

In this section, we describe a decomposition 
of the CBC signal space that reduces TD fil- 
tering cost sufficiently to allow for the possibil- 
ity of early-warning detection with modest com- 
puting requirements. We expand on the ideas 
of Marion & the Virgo Collaboration (2003) and 
Buskulic et al. (2010) that describe a multi-band 
decomposition of the compact binary signal space 
that resulted in a search with minutes latency 
during S6/VSR3 (Hughcy 2011). We combine 
this with the SVD rank-reduction method of 
Cannon ct al. (2010) that exploits the redundancy 
of the template banks. 

3.1. Conventional CBC searches 

Searches for inspiral signals typically employ 
matched filter banks that discretely sample the 
possible intrinsic parameters (Allen ct al. 2011). 



Suppose that the observed data x[k] consists of 
a known, nominal signal s[k], and additive, zero- 
mean noise n[k] 

x[k] = s[k] + n[k]. 

A matched filter is a linear filter, defined as 



N-l 

y[k] = ^ h[n]x[k 



n] =ys[k] +y„[k], 



where ys is the response of the filter to the sig- 
nal alone and i/n is the response of the signal to 
noise alone. The matched filter's coefficients max- 
imize the ratio of the expectation of the filter's 
instantaneous response to the variance in the fil- 
ter's output: 



(signal to noise) ^ 



ysm 



E[y[0]]^ 

var[y[fc]] va.r [y„[k]] 



It is well known (see, for example, Turin 1960) 
that if n[k] is Gaussian and wide-sense stationary, 
then the optimum is obtained when 



h[n] 



[n]S-'[ 



up to an arbitrary multiplicative constant. Here, 
h[n\, s[n\, and x[n\ are the discrete Fourier trans- 
forms (DFTs) of h[k], s[k], and x[k], respectively; 
S[n] = Fi [h[n]h* [n]] is the folded, two-sided, dis- 
crete power spectrum of n[k]. It is related to the 
continuous, one-sided power spectral density S{f) 
through 

(Sin) if n = or n = N/2 

S{nf/2N)/2 [iO<n<N/2 
S[N — n] otherwise, 

where N is the length of the filter and /" is the 
sample rate. (In order to satisfy the Nyquist- 
Shannon sampling criterion, it is assumed that the 
detector's continuous noise power spectral den- 
sity S{f) vanishes for all / > /°/2, or alterna- 
tively, that the data are low-pass filtered prior to 
matched filtering.) The DFT of the output is 

y[n] = s*[n] S^^\n\ x\n\ 

~S-^l\n\ S[n])* (s-^^\n\ £-[n]) . (3) 

The placement of parentheses in Equation (3) em- 
phasizes that the matched filter can be thought of 



as a cross-correlation of a whitened version of the 
data with a whitened version of the nominal signal. 
In this paper, we shall not describe the exact pro- 
cess by which the detector's noise power spectrum 
is estimated and deconvolved from the data; for 
the remainder of this paper we shall define a;[fc] 
as the whitened data stream. Correspondingly, 
from this point on we shall use h[k] to describe 
the whitened templates, being the inverse DFT of 

(^-i/2[n]s[n])*. 

Inspiral signals are continuously parameterized 
by a set of intrinsic source parameters 9 that de- 
termine the amplitude and phase evolution of the 
GW strain. For systems in which the effects of 
spin can be ignored, the intrinsic source parame- 
ters are just the component masses of the binary, 
9 = (mi, 7712). For a given source, the strain ob- 
served by the detector is a linear combination of 
two waveforms corresponding to the '-I-' and 'x' 
GW polarizations. Thus, we must design two fil- 
ters for each 9. 

The coefficients for the M filters are known as 
templates, and are formed by discretizing and time 
reversing the waveforms and weighting them by 
the inverse amplitude spectral density of the de- 
tector's noise. To construct a template bank, tem- 
plates are chosen with A//2 discrete signal param- 



eters 
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These are chosen such 



that any possible signal will have an inner prod- 
uct ^0.97 with at least one template. Such a 
template bank is said to have a minimal match 
of 0.97 (Owen & Sathyaprakash 1999). 

Filtering the detector data involves a convolu- 
tion of the data with the templates. For a unit- 
normalized template hi[k] and whitened detector 
data x[k], both sampled at a rate /°, the result 
can be interpreted as the S/N, Pi[k], defined as 



Af-l 



y hi[n] x[k — n]. 



(4) 



n=0 



This results in M S/N time series. Local peak- 
finding across time and template indices results in 
single-detector triggers. Coincidences are sought 
between triggers in different GW detectors in or- 
der to form detection candidates. 

Equation (4) can be implemented in the TD as 
a bank of finite impulse response (FIR) filters, re- 
quiring 0{MN) floating point operations per sam- 
ple. However, it is typically much more computa- 



tionally efficient to use the convolution theorem 
and the fast Fourier transform (FFT) to imple- 
ment fast convolution in the FD, requiring only 
0{MlgN) operations per sample but incurring a 
latency of 0{N) samples. 

3.2. The LLOID method 

Here we describe a method for reducing the 
computational cost of a TD search for CBC. 
We give a zero latency, real-time algorithm that 
competes in terms of floating point operations 
per second with the conventional overlap-save 
FD method, which by contrast requires a signifi- 
cant latency due to the inherent acausality of the 
Fourier transform. Our method, called LLOID 
(Low Latency Online Inspiral Detection) , involves 
two transformations of the templates that pro- 
duce a network of orthogonal filters that is far 
more computationally efficient than the original 
bank of matched filters. 

The ffist transformation is to chop the tem- 
plates into disjointly supported intervals, or time 
slices. Since the time slices of a given template 
are disjoint in time, they are orthogonal with re- 
spect to time. Given the chirp-like structure of 
the templates, the "early" (lowest frequency) time 
slices have significantly lower bandwidth and can 
be safely downsamplcd. Downsampling reduces 
the total number of filter coefficients by a factor 
of ~100 by treating the earliest part of the wave- 
form at ~1/100 of the full sample rate. Together, 
the factor of 100 reduction in the number of fil- 
ter coefficients and the factor of 100 reduction in 
the sample rate during the early inspiral save a 
factor of ~10^ floating point operations per sec- 
ond (flop s~^) over the original (full sample rate) 
templates. 

However, the resulting filters are still not or- 
thogonal across the parameter space and are in 
fact highly redundant. We use the SVD to ap- 
proximate the template bank by a set of orthog- 
onal basis filters (Caimon ct al. 2010). We find 
that this approximation reduces the number of fil- 
ters needed by another factor of ~' 100. These two 
transformations combined reduce the number of 
floating point operations to a level that is com- 
petitive with the conventional high-latency FD- 
matched filter approach. In the remainder of this 
section we describe the LLOID algorithm in detail 
and provide some basic computational cost scal- 



ing. 

3.2.1. Selectively reducing the sample rate of the 
data and templates 

The first step of our proposed method is to 
divide the templates into time slices in a TD 
analog to the FD decomposition employed by 
MBTA (Marion & the Virgo Collaboration 2003; 
Buskulic ct al. 2010). The application to GW 
data analysis is foreshadowed by an earlier FD 
convolution algorithm, proposed by Gardner 
(1995), based on splitting the impulse response 
of a filter into smaller blocks. We decompose each 
template hi[k] into a sum of S non-overlapping 
templates 
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S-1 
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s=0 
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otherwise 



(5) 



for S integers {ft''} such that = ft° < ft^ < 
■■■ < f°t^' = N. The outputs of these new 
time-sliced filters form an ensemble of partial S/N 
streams. By linearity of the ffltering process, these 
partial S/N streams can be summed to reproduce 
the S/N of the fuU template. 

Since waveforms with neighboring intrinsic 
source parameters have similar time-frequency 
evolution, it is possible to design computationally 
efficient time slices for an extended region of pa- 
rameter space rather than to design different time 
slices for each template. 

For concreteness and simplicity, consider an in- 
spiral waveform in the quadrupole approximation, 
for which the time-frequency relation is given by 
Equation (1). This monotonic time- frequency re- 
lationship allows us to choose time slice bound- 
aries that require substantially less bandwidth at 
early times in the inspiral. 

An inspiral signal will enter the detection band 
with some low frequency /low at time iiow before 
merger. Usually the template is truncated at some 
prescribed time t*^, or equivalent frequency /high, 
often chosen to correspond to the last stable orbit 
(LSO). The beginning of the template is critically 
sampled at 2/iow, but the end of the template is 
critically sampled at a rate of 2/high. In any time 
interval smaller than the duration of the template, 
the bandwidth of the filters across the entire tem- 
plate bank can be significantly less than the full 
sample rate at which data are acquired. 



Our goal is to reduce the filtering cost of 
a large fraction of the waveform by computing 
part of the convolution at a lower sample rate. 
Specifically we consider here time slice bound- 
aries with the smallest power-of-two sample rates 
that sub-critically sample the time-sliced tem- 
plates. The time slices consist of the S intervals 
[t°,t^), [i\i^), ..., [i'^-\i-^), sampled at fre- 
quencies /°, /^, ..., f^~^, where /^ is at least 
twice the highest nonzero frequency component of 
any filter in the bank for the sth time slice. 

The time-sliced templates can then be down- 
sampled in each interval without aliasing, so we 
define them as 



hm 




if f s; k/f' < t^+i 

otherwise. 



(6) 



We note that the time slice decomposition in 
Equation (5) is manifestly orthogonal since the 
time slices are disjoint in time. In the next sec- 
tion, we examine how to reduce the number of 
filters within each time slice via SVD of the time- 
sliced templates. 

3.2.2. Reducing the number of filters with the 
SVD 

As noted previously, the template banks used 
in inspiral searches are by design highly corre- 
lated. Cannon c^t al. (2()1()) showed that apply- 
ing the SVD to inspiral template banks greatly 
reduces the number of filters required to achieve 
a particular minimal match. A similar technique 
can be applied to the time-sliced templates as de- 
fined in Equation (6) above. The SVD is a matrix 
factorization that takes the form 



K[k] 



M-l 
1=0 



L''-l 



<i^?«?w«E^S'^''<w- (7) 



1=0 



where uf[k] are orthonormal basis templates re- 
lated to the original time-sliced templates through 
the reconstruction matrix, vfiCF^ . The expectation 
value of the fractional loss in S/N is the SVD tol- 
erance, given by 



L^-l 



E i^tf 



1=0 



E 

L 1=0 



i^t) 



determined by the number L'"^ of basis templates 
that are kept in the approximation. Cannon et al. 



(2010) showed that highly accurate approxima- 
tions of inspiral template banks could be achieved 
with few basis templates. We find that when com- 
bined with the time slice decomposition, the num- 
ber of basis templates L" is much smaller than the 
original number of templates Al and improves on 
the rank reduction demonstrated in Cannon et al. 
(2010) by nearly an order of magnitude. 

Because the sets of filters from each time slice 
form orthogonal subspaces, and the basis filters 
within a given time slice are mutually orthogo- 
nal, the set of all basis filters from all time slices 
forms an orthogonal basis spanning the original 
templates. 

In the next section, we describe how we form 
our early-warning detection statistic using the 
time slice decomposition and the SVD. 

3.2.3. Early-warning output 

In the previous two sections, we described two 
transformations that greatly reduce the computa- 
tional burden of TD filtering. We are now pre- 
pared to define our detection statistic, the early- 
warning output, and to comment on the compu- 
tational cost of evaluating it. 

First, the sample rate of the detector data must 
be decimated to match sample rates with each of 
the time slices. We will denote the decimated de- 
tector data streams using a superscript "s" to in- 
dicate the time slices to which they correspond. 
The operator H^ will represent the appropriate 
decimation filter that converts between the base 
sample rate f^ and the reduced sample rate /*: 

x'[k] = (i/^a;°)[fc]. 

We shall use the symbol H'^ to represent an inter- 
polation filter that converts between sample rates 
f'^^ and /^ of adjacent time slices. 



x'lk] = {H\ 



^s+l\ 



[k] 



From the combination of the time slice decom- 
position in Equation (6) and the SVD defined in 
Equation (7), we define the early-warning output 
accumulated up to time slice s using the recur- 



rence relation, 

S/N from previous time slices 



orthogonal FIR filters 



Af=-1 



1=0 n=0 



reconstruction 



(8) 



Observe that the early-warning output for time 
slice 0, p^[k], approximates the S/N of the origi- 
nal templates. The signal flow diagram in Figure 3 
illustrates this recursion relation as a multirate fil- 
ter network with a number of early-warning out- 
puts. 

Ultimately, the latency of the entire LLOID al- 
gorithm is set by the decimation and interpolation 
filters because they are generally time symmetric 
and slightly acausal. Fortunately, as long as the 
latency introduced by the decimation and inter- 
polation filters for any time slice s is less than 
that time slice's delay t", the total latency of the 
LLOID algorithm will be zero. To be concrete, 
suppose that the first time slice, sampled at a rate 
/o = 4096 Hz, spans times [t°, t^) = [0 s, 0.5 s), 
and the second time slice, sampled at /^ = 512 Hz, 
spans [t^, t^) = [0.5 s, 4.5 s). Then the second 
time slice's output, p}[k], will lead the first time 
slice's output, Pi[k], by 0.5 s. A decimation filter 
will be necessary to convert the 4096 Hz input sig- 
nal x[k] = x^[k] to the 512 Hz input a;^[fc], and an 
interpolation filter will be necessary to match the 
sample rates of the two early-warning outputs. In 
this example, as long as the decimation and inter- 
polation filters are together acausal by less than 
t^ ^ 0.5 s, the total S/N p\{k\ will be available 
with a latency of zero samples. When zero latency 
is important, we may take this as a requirement 
for the decimation and interpolation filter kernels. 

In the next section, we compute the expected 
computational cost scaling of this decomposition 
and compare it with the direct TD implementation 
of Equation (4) and higher latency blockwise FD 
methods. 

3.3. Comparison of computational costs 

We now examine the computational cost scal- 
ing of the conventional TD or FD matched filter 



procedure as compared with LLOID. For conve- 
nience. Table 2 provides a review of the notation 
that we will need in this section. 

3.3.1. Conventional TD method 

The conventional, direct TD method consists of 
a bank of FIR filters, or sliding-window dot prod- 
ucts. If there are M templates, each N samples 
in length, then each filter requires MN multipli- 
cations and additions per sample, or, at a sample 
rate /°, 

2MNf fiop s-^ (9) 

3.3.2. Conventional FD method 

The most common FD method is known as the 
overlap-save algorithm, described in Prc^ss ct al. 
(2007). It entails splitting the input into blocks 
of D samples, D > N , each block overlapping the 
previous one hy D ~ N samples. For each block, 
the algorithm computes the forward FFT of the 
data and each of the templates, multiplies them, 
and then computes the reverse FFT. 

Modern implementations of the FFT, such 
as the ubiquitous fftw, require about 2D\gD 
operations to evaluate a real transform of size 
D (Johnson & Frigo 2007). Including the for- 
ward transform of the data and M reverse trans- 
forms for each of the templates, the FFT costs 
2{M + l)DlgD operations per block. The mul- 
tiplication of the transforms adds a further 2MD 
operations per block. Since each block produces 
D — N usable samples of output, the overlap-save 



Table 2: Notation Used to Describe Filters. 



Definition 

/" Sample rate in time slice s 

M Number of templates 

A^ Number of samples per template 

S Number of time slices 

U Number of basis templates in time slice s 

N^ Number of samples in decimated time slice s 

N^ Length of decimation filter 

iV^ Length of interpolation filter 



method requires 

2{M+l)lgD + 2M 



f 



l-N/D 



flop s~^ 



(10) 



In the hmit of many templates, M ^ 1, we can 
neglect the cost of the forward transform of the 
data and of the multiphcation of the transforms. 
The computational cost will reach an optimum at 
some large but finite FFT block size D ^ N. 
In this limit, the FD method costs « 2f°M\gD 
flop s-i. 

By adjusting the FFT block size, it is possible 
to achieve low latency with FD convolution, but 
the computational cost grows rapidly as the la- 
tency in samples {D — N) decreases. It is easy to 
show that in the limit of many templates and long 
templates, Af, IgiV 3> 1, the computational cost 
scales as 



template length 
latency 



(2/"Mlg7V). 



3.3.3. LLOID method 



For time slice s, the LLOID method requires 
2N'^L^ f flop s^^ to evaluate the orthogonal fil- 
ters, 2ML^ f^ flop s~^ to apply the linear trans- 
formation from the L'^ basis templates to the M 
time-sliced templates, and M f^ flop s^^ to add 
the resultant partial S/N stream. 

The computational cost of the decimation of 
the detector data is a little bit more subtle. Dec- 
imation is achieved by applying an FIR anti- 
aliasing fllter and then downsampling, or delet- 
ing samples in order to reduce the sample rate 
from /•''"^ to /*. Naively, an anti-aliasing fll- 
ter with [f^""^ / f^)N^ coefficients should demand 
2N^{f'^~^Y / f^ flop s~^. However, it is neces- 
sary to evaluate the anti-aliasing filter only for 
the fraction f^/f''~^ of the samples that will not 
be deleted. Consequently, an efficient decima- 
tor requires only 2N^f^^^ flop s^^. (One com- 
mon realization is an ingenious structure called 
a polyphase decimator, described in Chapter 1 of 
Jovanovic-Dolecck (2002).) 

The story is similar for the interpolation fll- 
ters used to match the sample rates of the par- 
tial S/N streams. Interpolation of a data stream 
from a sample rate /" to f^~^ consists of in- 
serting zeros between the samples of the origi- 
nal stream, and then applying a low-pass fllter 



with {r~^/f')N^ coeflicicnts. The low-pass fil- 
ter requires 2MN'^ {f^'^Y / f^ flop s~^. However, 
by taking advantage of the fact that by construc- 
tion a fraction f^ / f^^^ of the samples are zero, 
it is possible to build an efficient interpolator 
that requires only 2MN^ f^^ flop s^^. (Again, 
see Jovanovic-Dolccek (2002) for a discussion of 
polyphase interpolation.) 

Taking into account the decimation of the de- 
tector data, the orthogonal FIR filters, the recon- 
struction of the time-sliced templates, the inter- 
polation of S/N from previous time slices, and the 
accumulation of S/N, in total the LLOID algo- 
rithm requires 



s-i 



Y^ {2N''L' + 2ML' + M) f 



s=0 

+ 2j2{N^f + MN^f'-^) (11) 
/^^e{/'':0<fc<s} 

fiop s^^. The second sum is carried out over the 
set of distinct sample rates (except for the base 
sample rate) rather than over the time slices them- 
selves, as we have found that it is sometimes desir- 
able to place multiple adjacent time slices at the 
same sample rate in order to keep the size of ma- 
trices that enter the SVD manageable. Here we 
have assumed that the decimation filters are con- 
nected in parallel, converting from the base sample 
rate /° to each of the time slice sample rates /^, 
/^ , . . . , and that the interpolation filters are con- 
nected in cascade fashion with each interpolation 
filter stepping from the sample rate of one time 
slice to the next. 

We can simplify this expression quite a bit by 
taking some limits that arise from sensible filter 
design. In the limit of many templates, the cost of 
the decimation filters is negligible as compared to 
the cost of the interpolation filters. Typically, we 
will design the interpolation filters with N'^ < L^ 
so that the interpolation cost itself is negligible 
compared with the reconstruction cost. Finally, if 
the number of basis templates per time slices L^ is 
not too small, the reconstruction cost dominates 
over the cost of accumulating the partial S/N. In 
these limits, the cost of LLOID is dominated by 
the basis filters themselves and the reconstruction, 
totaling 2 Yfs=o f'^' i^" + ^^) &«? s"^- 
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3.3.4- Speedup of LLOID relative to TD method 



4.1. Planning stage 



If the cost of the basis filters dominates, and the 
frequency of the templates evolves slowly enough 
in time, then we can use the time-frequency rela- 
tionship of Equation (1) to estimate the speedup 
relative to the conventional, direct TD method. 
The reduction in flop s~^ is approximately 



sEfJo^i^iv^ 



2MNp 



(iiow - ihigh) (/°) Jt 



*high 



{2f{t)f dt 



16a (tlowf (ilow) — thighf (ihigh)) 

(/°)' (iiow - thigh) 



(12) 



where a ~ L'"^ /M is the rank reduction factor, or 
ratio between the number of basis templates and 
the number of templates. This approximation as- 
sumes that the frequency of the signal is evolv- 
ing very slowly so that we can approximate the 
time slice sample rate as twice the instantaneous 
GW frequency, f^ « 2/(i), and the number of 
samples in the decimated time slice as the sample 
rate times an infinitesimally short time interval, 
A^''* ~ 2/(t) At. The integral is evaluated using the 
power-law form of f{t) from Equation (1). Sub- 
stituting approximate values for a template bank 
designed for component masses around (1.4, 1.4) 
Mq, a « 10"2, tiow = 103 s, /low - 10^ Hz, 
/high = /isco « 1570 Hz, f = 2/isco, and 
ihigh = /isco^ I we find from Equation (12) that 
the LLOID method requires only ^ 10~^ times as 
many flop s~^ as the conventional TD method. 

4. Implementation 

In this section we describe an implementation 
of the LLOID method described in Section 3 suit- 
able for rapid GW searches for CBCs. The LLOID 
method requires several computations that can be 
completed before the analysis is underway. Thus, 
we divide the procedure into an offline planning 
stage and an online, low-latency filtering stage. 
The offline stage can be done before the analysis 
is started and updated asynchronously, whereas 
the online stage must keep up with the detector 
output and produce search results as rapidly as 
possible. In the next two subsections we describe 
what these stages entail. 



The planning stage begins with choosing tem- 
plates that cover the space of source parameters 
with a hexagonal grid (Cokclaer 2007) in order 
to satisfy a minimal match criterion. This as- 
sures a prescribed maximum loss in S/N for signals 
whose parameters do not lie on the hexagonal grid. 
Next, the grid is partitioned into groups of neigh- 
bors called sub-banks that are appropriately sized 
so that each sub-bank can be efficiently handled 
by a single computer. Each sub-bank contains 
templates of comparable chirp mass, and there- 
fore similar time-frequency evolution. Dividing 
the source parameter space into smaller sub-banks 
also reduces the offline cost of the SVD and is 
the approach considered in Cannon et al. (2010). 
Next, we choose time slice boundaries as in Equa- 
tion (6) such that all of the templates within a sub- 
bank are sub-critically sampled at progressively 
lower sample rates. For each time slice, the tem- 
plates are downsampled to the appropriate sam- 
ple rate. Finally, the SVD is applied to each time 
slice in the sub-bank in order to produce a set of 
orthonormal basis templates and a reconstruction 
matrix that maps them back to the original tem- 
plates as described in Equation (7). The down- 
sampled basis templates, the reconstruction ma- 
trix, and the time slice boundaries are all saved to 
disk. 

4.2. Filtering stage 

The LLOID algorithm is amenable to latency- 
free, real-time implementation. However, a real- 
time search pipeline would require integration di- 
rectly into the data acquisition and storage sys- 
tems of the LIGO observatories. A slightly more 
modest goal is to leverage existing low latency, but 
not real-time, signal processing software in order 
to implement the LLOID algorithm. 

We have implemented a prototype of the low- 
latency filtering stage using an open-source signal 
processing environment called GStreamer'^ (ver- 
sion 0.10.33). GStreamer is a vital component 
of many Linux systems, providing media play- 
back, authoring, and streaming on devices from 
cell phones to desktop computers to streaming me- 
dia servers. Given the similarities of GW detec- 



^http : //gstreamer . net/ 



11 



tor data to audio data it is not surprising that 
GStrcamer is useful for our purpose. GStrcamcr 
also provides some useful stock signal processing 
elements such as resamplers and filters. We have 
extended the GStreamer framework by developing 
a library called gstlal'* that provides elements for 
GW data analysis. 

GStreamer pipelines typically operate with very 
low (in some consumer applications, imperceptibly 
low) latency rather than in true real time because 
signals are partitioned into blocks of samples, or 
buffers. This affords a number of advantages, in- 
cluding amortizing the overhead of passing signals 
between elements and grouping together sequences 
of similar operations. However, buffering a signal 
incurs a latency of up to one buffer length. This la- 
tency can be made small at the cost of some addi- 
tional overhead by making the buffers sufficiently 
small. In any case, buffering is a reasonable strat- 
egy for low-latency LIGO data analysis because, 
as we previously remarked, the LIGO data acqui- 
sition system has a granularity of 1/16 s. 

5. Results 

In this section we evaluate the accuracy of the 
LLOID algorithm using our GStreamer-based im- 
plementation described in the previous section. 
We calculate the measured S/N loss due to the ap- 
proximations of the LLOID method and our imple- 
mentation of it. Using a configuration that gives 
acceptable S/N loss for our chosen set of source 
parameters, we then compare the computational 
cost in flop s~^ for the direct TD method, the 
overlap-save FD method, and LLOID. 



TaylorF2 waveforms described in Buonanno et al. 
2009). Templates are truncated at 10 Hz, where 
the projected sensitivity of Advanced LIGO is in- 
terrupted by the "seismic wall." This results in a 
grid of 98,544 points, or 2 x 98, 544 = 197, 088 tem- 
plates. Then we create sub-banks by partitioning 
the parameter space by chirp mass. Figure 4 illus- 
trates this procedure. We concentrate on a sub- 
bank with 657 points with chirp masses between 
1.1955 and 1.2045 Mq, or 2 x 657 = 1314 tem- 
plates. With this sub-bank we are able to con- 




1.0 1.5 2.0 2.5 
chirp mass, M. (Mg) 



3.0 



Fig. 4. — Source parameters selected for sub-bank 
used in this case study, consisting of component 
masses mi and TO2, between 1 and 3 Mq, and 
chirp masses M between 1.1955 and 1.2045 Mq. 

struct an efficient time slice decomposition that 
consists of 11 time slices with sample rates be- 
tween 32 and 4096 Hz summarized in Table 3. We 
use this sub-bank and decomposition for the re- 
mainder of this section. 



5.1. Setup 

We examine the performance of the LLOID al- 
gorithm on a small region of compact binary pa- 
rameter space centered on typical NS-NS masses. 
We begin by constructing a template bank that 
spans component masses from 1 to 3 Mq using 
a simulated Advanced LIGO noise power spec- 
trum (Shoemaker 2010)^. Waveforms are gener- 
ated in the frequency domain in the stationary 
phase approximation at (post)'^'''-Ncwtonian order 
in phase and Newtonian order in amplitude (the 
loss to arise in our implementation of the LLOID 

*https://wwH.lsc-group.phys.u™.edu/daswg/projects/gstlal.htnglgQj.i^Jjj^_ ^j^g gj.g^ jg ^J^g g/^ loss duC tO the 
5http://dcc.ligo.org/cgi-bin/DocDB/ShowDocument?docid=T09002^3i{ygg|:j.jg=3of ^j^g gy^ .^^ j^s ^ ^^ ^^^^g^g ^^^_ 



5.2. Measured S/N loss 

The S/N loss is to be compared with the mis- 
match of 0.03 that arises from the discreteness of 
template bank designed for a minimal match of 
0.97. We will consider an acceptable target S/N 
loss to be a factor of 10 smaller than this, that is, 
no more than 0.003. 

We expect two main contributions to the S/N 
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Table 3: Filter Design Sub-Bank of 1314 Templates. 



(Hz) 


(s) 


N" 


— logiQ (1— SVD tolerance) 
12 3 4 5 6 



a; 



V 



3 32 Hz 
D 64 Hz 

1 128 Hz 



256 Hz 
512 Hz 
4096 Hz 



X 



JUZ 



JZ 



X 



K<3 






.<5 



^^"'b^^ 






time relative to coalescence (s) 



4096 [0,0.5) 2048 

512 [0.5,4.5) 2048 

256 [4.5,12.5) 2048 

128 [12.5,76.5) 8192 

64 [76.5,140.5) 4096 

64 [140.5,268.5) 8192 

64 [268.5,396.5) 8192 

32 [396.5,460.5) 2048 

32 [460.5,588.5) 4096 

32 [588.5,844.5) 8192 

32 [844.5,1100.5) 8192 



4 
6 
6 
20 
8 
7 
1 
1 
1 
1 
1 



25 
15 
21 
15 
3 
7 



10 
10 
28 
18 
25 
20 
9 
16 
26 
12 



10 
12 
12 
30 
20 
28 
23 
12 
18 
30 
20 



14 
16 
15 
32 
22 
30 
25 
14 
21 
33 
23 



Note. — From left to right, this table shows the sample rate, time interval, number of samples, and number of orthogonal 
templates for each time slice. We vary SVD tolerance from (l — lO^-') to (l — 10"*^). 



plates. As remarked upon in Cannon ct al. (2010) 
and Section 3.2.2, this effect is measured by the 
SVD tolerance. The second comes from the lim- 
ited bandwidth of the interpolation filters used to 
match the sample rates of the partial S/N streams. 
The maximum possible bandwidth is determined 
by the length of the filter, N^ . S/N loss could also 
arise if the combination of both the decimation 
filters and the interpolation filters reduces their 
bandwidth measurably, if the decimation and in- 
terpolation filters do not have perfectly uniform 
phase response, or if there is an unintended sub- 
sample time delay at any stage. 

To measure the accuracy of our GStreamer im- 
plemention of LLOID including all of the above 
potential sources of S/N loss, we conducted im- 
pulse response tests. The GStreamer pipeline was 
presented with an input consisting of a unit im- 
pulse. By recording the outputs, we can effec- 
tively "play back" the templates. These impulse 
responses will be similar, but not identical, to the 
original, nominal templates. By taking the inner 
product between the impulses responses for each 



output channel with the corresponding nominal 
template, we can gauge exactly how much S/N is 
lost due to the approximations in the LLOID algo- 
rithm and any of the technical imperfections men- 
tioned above. We call one minus this dot product 
the mismatch relative to the nominal template. 

The two adjustable parameters that affect per- 
formance and mismatch the most are the SVD 
tolerance and the length of the interpolation fil- 
ter. The length of the decimation filter affects 
mismatch as well, but has very little impact on 
performance. 

Effect of SVD tolerance We studied how the 
SVD tolerance affected S/N loss by holding iV^ = 
iV^ = 192 fixed as we varied the SVD tolerance 
from (l — 10~^) to (l — 10~^). The minimum, 
maximum, and median mismatch are shown as 
functions of SVD tolerance in Figure 5(a). As 
the SVD tolerance increases toward 1, the SVD 
becomes an exact matrix factorization, but the 
computational cost increases as the number of 
basis filters increases. The conditions presented 
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here are more complicated than in the original 
work (Cannon et al. 2010) due to the inclusion 
of the time-sliced templates and interpolation, 
though we still see that the average mismatch is 
approximately proportional to the SVD tolerance 
down to (l — 10~^) . However, as the SVD toler- 
ance becomes even higher, the median mismatch 
seems to saturate around 2 x 10~^. This could be 
the effect of the interpolation, or an unintended 
technical imperfection that we did not model or 
expect. However, this is still an order of magni- 
tude below our target mismatch of 0.003. Wc find 
that an SVD tolerance of (l — lO^**) is adequate 
to achieve our target S/N loss. 

Effect of interpolation filter length Next, 

keeping the SVD tolerance fixed at (l — 10~^) 
and the length of the decimation filter fixed at 
A^^ = 192, we studied the impact of the length 
A^^ of the interpolation filter on mismatch. We use 
GStreamer's stock audioresample element, which 
provides an FIR decimation filter with a Kaiser- 
windowed sine function kernel. The mismatch as 
a function of iV^ is shown in Figure 5(b). The 
mismatch saturates at ~2 x 10^* with N'^ = 64. 
We find that a filter length of 16 is sufficient to 
meet our target mismatch of 0.003. 

Having selected an SVD tolerance of (l — 10^**) 
and A'"'^ = 16, we found that we could reduce N^ to 
48 without exceeding a median mismatch of 0.003. 

We found that careful design of the decimation 
and interpolation stages made a crucial difference 
in terms of computational overhead. Connecting 
the interpolation filters in cascade fashion rather 
than in parallel resulted in a significant speedup. 
Also, only the shortest interpolation filters that 
met our maximum mismatch constraint resulted 
in a sub-dominant contribution to the overall cost. 
There is possibly further room for optimization 
beyond minimizing N'^ . We could design custom 
decimation and interpolation filters, or we could 
tune these filters separately for each time slice. 

5.3. Other potential sources of S/N loss 

One possible source of S/N loss for which we 
have not accounted is the leakage of sharp spec- 
tral features in the detector's noise spectrum due 
to the short durations of the time slices. In the 
LLOID algorithm, as with many other GW search 



methods, whitening is treated as an entirely sep- 
arate data conditioning stage. In this paper, we 
assume that the input to the filter bank is already 
whitened, having been passed through a filter that 
flattens and normalizes its spectrum. We elected 
to omit a detailed description of the whitening pro- 
cedure since the focus here is on the implementa- 
tion of a scalable inspiral filter bank. 

However, the inspiral templates themselves con- 
sist of the GW time series convolved with the im- 
pulse response of the whitening filter. As a con- 
sequence, the LLOID algorithm must faithfully 
replicate the effect of the whitening filter. Since 
in practice the noise spectra of ground-based GW 
detectors contain both high-Q lines at mechanical, 
electronic, and control resonances and a very sharp 
roUoff at the seismic wall, the frequency response 
of the LLOID filter bank must contain both high- 
er notches and a very abrupt high-pass filter. FIR 
filters with rapidly varying frequency responses 
tend to have long impulse responses and many co- 
efficients. Since the LLOID basis filters have, by 
design, short impulse responses and very few co- 
efficients, one might be concerned about spectral 
leakage contaminating the frequency response of 
the LLOID filter bank. 

The usual statement of the famous Nyquist- 
Shannon theorem, stated below as Theorem 1, has 
a natural dual. Theorem 2, that addresses the fre- 
quency resolution that can be achieved with an 
FIR filter of a given length. 

Theorem 1. (After Oppenheim, et al. 1997, p. 
518) Let x(t) be a band-limited signal with con- 
tinuous Fourier transform x{f) such that x{f') = 
V /' : I /'I > f]\i- Then, x{t) is uniquely de- 
termined by its discrete samples x{n/f^), n = 
0,±1,±2,...,z//">2/m. 

Theorem 2. Let x{t) be a compactly supported 
signal such that x(t') = V i' : \t'\ > Im- Then 
its continuous Fourier transform x{f) is uniquely 
determined by the discrete frequency components 
x{n Af), n = 0, ±1, ±2, . . . , if Af < l/{2tM). 

Another way of stating Theorem 2 is that, pro- 
vided x{t) is nonzero only for |t| < 1/(2 A/), 
the continuous Fourier transform can be recon- 
structed at any frequency / from a weighted sum 
of sine functions centered at each of the discrete 
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frequency components, namely, 

oo 

x(/)(x ^ £(nA/)sinc[7r(/-nA/)/A/]. 



n— — oo 



Failure to meet the conditions of this dual of the 
sampling theorem results in spectral leakage. For 
a TD signal to capture spectral features that are 
the size of the central lobe of the sine function, the 
signal must have a duration greater than 1/A/. If 
the signal x{t) is truncated by sampling it for a 
shorter duration, then its Fourier transform be- 
comes smeared out; conceptually, power "leaks" 
out into the side lobes of the sine functions and 
washes away sharp spectral features. In the GW 
data analysis literature, the synthesis of inspi- 
ral matched filters involves a step called inverse 
spectrum truncation (see Allen et al. 2011, Section 
VII) that fixes the number of coefficients based on 
the desired frequency resolution. 

In order to effectively flatten a line in the de- 
tector's noise power spectrum, the timescale of the 
templates must be at least as long as the damping 
time T of the line, r = 2Q/ujq, where Q is the qual- 
ity factor of the line and Wg is the central angular 
frequency. To put this into the context of the sam- 
pling theorem, in order to resolve a notch with a 
particular Q and /o, an FIR filter must achieve a 
frequency resolution of A/ > tt/q/Q and there- 
fore its impulse response must last for at least a 
time 1/A/ = Q/tt/q. For example, in the S6 de- 
tector configuration known as "Enhanced LIGO," 
the viohn modes (Pcnn ot al. 2007) had Q '^ 10^ 
and Wo ~ (27r)340 rad s^^. for a coherence time 
T ~ 102 s. 

In our example template bank, many of the 
time slices are much shorter than this. However, 
in summation the time slices have the same du- 
ration as the full templates themselves, and the 
full templates are much longer than many coher- 
ence times of the violin mode. For this reason, we 
speculate that LLOID should be just as robust 
to sharp line features as traditional FFT-based 
searches currently employed in the GW field. Fu- 
ture works must verify this reasonable supposition 
with numerical experiments, including impulse re- 
sponse studies similar to the ones presented here 
but with detector noise power spectra containing 
lines with realistically high quality factors. 

There could, in principle, be lines with coher- 



ence times many times longer than the template 
duration. For example, the Q of the violin modes 
may increase by orders of magnitude in Advanced 
LIGO (Strain & Cagnoli 2006). Also, there are 
certainly narrow lines that are non-stationary. 
Both of these cases can be dealt with by prepro- 
cessing h{t) with bandstop filters that attenuates 
the lines themselves but also conservatively large 
neighborhoods around them. If such bandstops 
were implemented as an FIR filter, they could be 
built into the time slices without any difficulty. 

Another way to deal with line features with 
coherence times much longer than the templates 
would be to entirely 'factor' the whitening out of 
the LLOID filter bank. Any line features could be 
notched out in the whitening stage with IIR filters, 
which can achieve infinitely high Q at just second 
order. If the detector data were passed through 
the whitening filter twice, then time-sliced filters 
need not depend on the detector's noise power 
spectral density at all. In such a variation on 
the LLOID method, the basis filters could be cal- 
culated from the weighted SVD (Gabriel Hz Zaniir 
1979; Jackson 2003, Chapter 3.6) of the time-sliced 
templates, using the covariance of the detector 
noise as a weight matrix. 

5.4. Lower bounds on computational cost 
and latency compared to other meth- 
ods 

We are now prepared to offer the estimated 
computational cost of filtering this sub-bank of 
templates compared to other methods. We used 
the results of the previous subsections to set the 
SVD tolerance to (l — 10^^), the interpolation fil- 
ter length to 16, and the decimation filter length 
to 48. Table 4 shows the computational cost in 
flop s~^ for the sub-bank we described above. For 
the overlap-save FD method, an FFT block size 
oi D — 2N is assumed, resulting in a latency of 
{N/f°) seconds. Both the FD method and LLOID 
are five orders of magnitude faster than the con- 
ventional, direct TD method. However, the FD 
method has a latency of over half of an hour, 
whereas the LLOID method, with suitable design 
of the decimation and interpolation filters, has no 
more latency than the direct TD method. 
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Table 4: Computational Cost in Flop s ^ and Latency in Seconds of the Direct TD Method, the Overlap-Save 
FD Method, and LLOID. 



Method 


Flop s-^ 
(Sub-bank) 


Latency (s) 


Flop s-^ 

(NS-NS) 


number of 
Machines 


Direct (TD) 
Overlap-save (FD) 
LLOID (theory) 
LLOID (prototype) 


4.9 X 10^-^ 
5.2 X 10® 
6.6 X 10** 
(0.9 cores) 




2x 103 



0.5 


3.8 X lO^'' 

5.9 X 10^" 
1.1 X 10" 


-3.8 X lO'' 

-5.9 

-11 





Note. — Cost is given for both the sub-bank described in Section 5.1 and a full 1-3 Mq NS-NS search. The last column gives 
the approximate number of machines per detector required for a full Advanced LIGO NS-NS search. 



5.5. Extrapolation of computational cost 
to an Advanced LIGO search 

Table 4 shows that the LLOID method re- 
quires 6.6 X lO'^ flop s~^ to cover a sub-bank com- 
prising 657 out of the total 98,544 mass pairs. 
Assuming that other regions of the parameter 
space have similar computational scaling, an en- 
tire single-detector search for NS-NS signals in the 
1-3 Mq component mass range could be imple- 
mented with (98, 544/657) « 150 times the cost, 
or 9.9 X 10^° flops"^ 

We computed the computational cost of a full 
Advanced LIGO NS-NS search a second way by 
dividing the entire 1-3 Mq parameter space into 
sub-banks of 657 points apiece, performing time 
slices and SVDs for each sub-bank, and tabulating 
the number of floating point operations using Ex- 
pression (11). This should be a much more accu- 
rate measure because template length varies over 
the parameter space. Lower chirp mass templates 
sweep through frequency more slowly and require 
more computations while higher chirp mass tem- 
plates are shorter and require fewer computations. 
Despite these subtleties, this estimate gave us 
1.1 X 10""^^ flop s~^ , agreeing with the simple scaling 
argument above. 

Modern (ca. 2011) workstations can achieve 
peak computation rates up to —10^^ flop s^^. In 
practice, we expect that a software implementa- 
tion of LLOID will reach average computation 
rates that are perhaps a factor 10 less than this, 
— 10^° flop s~^ per machine, due to non- floating 
point tasks including bookkeeping and thread syn- 



chronization. Given these considerations, we esti- 
mate that a full Advanced LIGO, single-detector, 
NS-NS search with LLOID in will require —10 ma- 
chines. 

By comparison, using the conventional TD 
method to achieve the same latency costs 4.9 x 
10^3 flop s~^ for this particular sub-bank, and so 
simply scaling up by the factor of 150 suggests 
that it would require 7.4 x 10^^ flop s~^ to search 
the full parameter space. To account for the vary- 
ing sample rate and template duration across the 
parameter space, we can also directly calculate the 
cost for the full TD method search using expres- 
sion (9), resulting in 3.8 x 10^^ flop s~^, agreeing 
within an order of magnitude. This would re- 
quire ^10^ current-day machines. Presently, the 
LIGO Data Grid^ consists of only —10^ machines, 
so direct TD convolution is clearly impractical. 

The overlap-save FD method is slightly more 
efficient than LLOID for this particular sub-bank, 
requiring 5.2 x 10** flop s^^. The scaling argu- 
ment projects that a full FD search would require 
7.8 X 10^" flop s^^. The direct calculation from 
Expression (10) gives 5.9 x 10^° flop s~^, in order- 
of-magnitude agreement. In this application, the 
conventional FD search is scarcely a factor of two 
faster than LLOID while gaining only 0.3% in 
S/N, but only at the price of thousands of seconds 
of latency. 



"https : //www . Isc-group . phys . uwm . edu/lscdatagr id/ 
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5.6. Measured latency and overhead 

Our GStreamer pipeline for measuring impulse 
responses contained instrumentation that would 
not be necessary for an actual search, including 
additional interpolation filters to bring the early- 
warning outputs back to the full sample rate and 
additional outputs for recording signals to disk. 

We wrote a second, stripped pipeline to eval- 
uate the actual latency and computational over- 
head. We executed this pipeline on one of the 
submit machines of the LIGO-Caltech cluster, a 
Sun Microsystems Sun Fire"^^ X4600 M2 server 
with eight quad-core 2.7 GHz AMD Opteron'^'^ 
8384 processors. This test consumed ^90% of the 
capacity of just one out of the 32 cores, maintain- 
ing a constant latency of ^0.5 s. 

The measured overhead is consistent to within 
an order of magnitude with the lower bound from 
the flop s~^ budget. Additional overhead is pos- 
sibly dominated by thread synchronization. A 
carefully optimized GStreamer pipeline or a hand- 
tuned C implementation of the pipeline might re- 
duce overhead further. 

The 0.5 s latency is probably due to buffering 
and synchronization. The latency might be re- 
duced by carefully tuning buffer lengths at every 
stage in the pipeline. Even without further re- 
finements, our implementation of the LLOID al- 
gorithm has achieved latencies comparable to the 
LIGO data acquisition system itself. 

6. Conclusions 

We have demonstrated a computationally feasi- 
ble filtering algorithm for the rapid and even early- 
warning detection of GWs emitted during the co- 
alescence of NSs and stellar-mass black holes. It is 
one part of a complicated analysis and observation 
strategy that will unfortunately have other sources 
of latency. However, we hope that it will motivate 
further work to reduce such technical sources of 
GW observation latency and encourage the possi- 
bility of even more rapid EM follow-up observa- 
tions to catch prompt emission in the advanced 
detector era. 

CBC events may be the progenitors of some 
short hard GRBs and are expected to be accom- 
panied by a broad spectrum of EM signals. Rapid 
alerts to the wider astronomical community will 



improve the chances of detecting an EM coun- 
terpart in bands from gamma-rays down to ra- 
dio. In the Advanced LIGO era, it appears possi- 
ble to usefully localize a few rare events prior to 
the GRB, allowing multi- wavelength observations 
of prompt emission. More frequently, low-latency 
alerts will be released after merger but may still 
yield extended X-ray tails and early on-axis after- 
glows. 

The LLOID method is as fast as conventional 
FFT-based, FD convolution but allows for latency 
free, real-time operation. We anticipate requiring 
>40 modern multi-core computers to search for bi- 
nary NSs using coincident GW data from a four- 
detector network. In the future, additional com- 
putational savings could be achieved by condition- 
ally reconstructing the S/N time series only during 
times when a composite detection statistic crosses 
a threshold (Cannon et al. 2011). However, the 
anticipated required number of computers is well 
within the current computing capabilities of the 
LIGO Data Grid. 

We have shown a prototype implementation of 
the LLOID algorithm using GStreamer, an open- 
source signal processing platform. Although our 
prototype already achieves latencies of less than 
one second, further fine tuning may reduce the 
latency even further. Ultimately the best possi- 
ble latency would be achieved by tighter integra- 
tion between data acquisition and analysis with 
dedicated hardware and software. This could be 
considered for third-generation detector design. 
Also possible for third-generation instruments, the 
LLOID method could provide the input for a dy- 
namic tuning of detector response via the signal 
recycling mirror to match the frequency of max- 
imum sensitivity to the instantaneous frequency 
of the GW waveform. This is a challenging tech- 
nique, but it has the potential for substantial gains 
in S/N and timing accuracy (Mcers et al. 1993). 

Although we have demonstrated a computa- 
tionally feasible statistic for advance detection, we 
have not yet explored data calibration and whiten- 
ing, triggering, coincidence, and ranking of GW 
candidates in a framework that supports early EM 
follow-up. One might explore these and also using 
the time slice decomposition and the SVD to form 
low-latency signal-based vetoes (e.g., x^ statistics) 
that have been essential for glitch rejection used 
in previous GW CBC searches. These additional 
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stages may incur some extra overhead, so comput- 
ing requirements will likely be somewhat higher 
than our estimates. 

Future work must more deeply address sky lo- 
calization accuracy in a realistic setting as well 
as observing strategies. Here, we have followed 
Fairhurst (2009) in estimating the area of 90% 
localization confidence in terms of timing uncer- 
tainties alone, but it would be advantageous to 
use a galaxy catalog to inform the telescope tiling 
(Nuttall fc Sutton 2Ufl)). Because early detec- 
tions will arise from nearby sources, the galaxy 
catalog technique might be an important ingredi- 
ent in reducing the fraction of sky that must be 
imaged. Extensive simulation campaigns incorpo- 
rating realistic binary merger rates and detector 
networks will be necessary in order to fully under- 
stand the prospects for early-warning detection, 
localization, and EM follow-up using the tech- 
niques we have described. 
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Decimation Orthogonal Reconstruction Interpolation and 
of input Fjpj^ filters matrices SNR accumulation 



X o >• 




>o Pa 

>opr 

X3 P2 



■K) Pm-1 



Fig. 3. — Schematic of LLOID pipeline illustrating signal flow. Circles with arrows represent interpolation 
© or decimation ©. Circles with plus signs represent summing junctions ©. Squares D stand for FIR 
filters. Sample rate decreases from the top of the diagram to the bottom. In this diagram, each time slice 
contains three FIR filters that are linearly combined to produce four output channels. In a typical pipeline, 
the number of FIR filters is much less than the number of output channels. 
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(a) Mismatch versus SVD tolerance 
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Fig. 5. — Box-and-whisker plot of mismatch between nominal template bank and LLOID measured impulse 
responses. The upper and lower boundaries of the boxes show the upper and lower quartiles; the lines 
in the center denote the medians. The whiskers represent the minimum and maximum mismatch over all 
templates. In (a) the interpolation filter length is held fixed at N^ = 192, while the SVD tolerance is varied 
from (1 - 10^1) to (1 - 10"^). In (b), the SVD tolerance is fixed at (l - 10"*^) while iV^ is varied from 8 
to 192. 
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A. Low frequency cutoff for inspiral searches 

Ground-based GW detectors are unavoidably affected at low frequencies by seismic and anthro- 
pogenic ground motion. The LIGO test masses are suspended from multiple-stage pcndula, which 
attenuate ground motion down to the pole frequency. In the detector configuration in place dur- 
ing S6, seismic noise dominated the instrumental background below about 40 Hz. Considerable effort 
is being invested in improving seismic attenuation in Advanced LIGO using active and passive isola- 
tion (Harry & the LIGO Scientific Collaboration 2010), so that suspension thermal noise will dominate 
down to 10-15 Hz. Inspiral waveforms are chirps of infinite duration, but since an interferometric detector's 
noise diverges at this so-called "seismic wall," templates for matched filter searches are truncated at a 
low-frequency cutoff /low in order to save computational overhead with negligible loss of SNR. 

The expected matched-filter SNR, integrated from /jow to /high, is given by Equation (2). The high- 
frequency cutoff for the inspiral is frequently taken to be the GW frequency at the LSO; for non- 
spinning systems, /lso = 44OO(Af0/M) Hz, where M is the total mass of the binary (section 3.4.1 of 
Sathyaprakash & Schutz 2009). The choice of /low is based on the fraction of the total SNR that is accumu- 
lated at frequencies above /low To illustrate the relative contributions to the SNR at different frequencies 
for a (1.4, 1.4) M© binary, we normalized and plotted the integrand of Equation (2), the noise-weighted 
power spectral density of the inspiral waveform, in Figure 6(b). This is the quantity 




which is normalized by the total SNR squared in order to put detectors with different absolute sensitivities 
on the same footing. We used several different noise power spectra: all of the envisioned Advanced LIGO 
configurations from Shoemaker (2010); the best-achieved sensitivity at LIGO Hanford Observatory (LHO) in 
LIGO's fifth science run (S5), measured by Abadic ct al. (2010a); and the best-achieved sensitivity at LHO 
during S6, measured by Abadic et al. (2011b). (The noise spectra themselves are shown in Figure 6(a).) It 
is plain that during S5 and S6 the greatest contribution to the S/N was between 100 and 150 Hz, but for 
all of the proposed Advanced LIGO configurations the bulk of the S/N is accumulated below 60 Hz. This 
information is presented in a complementary way in Figure 6(c), as the square root of the cumulative integral 
from /low to /lso, interpreted as a fraction of the total "available" S/N, 



Pfrac(/low) 



/■/lso f-7/Z 




Table 5 shows the fractional accumulated S/N for four selected low-frequency cutoffs, 40 Hz, 30 Hz, 20 Hz, 
and 10 Hz. In S5 and S6, all of the S/N is accumulated above 40 Hz. For the 'high frequency' Advanced 
LIGO configuration, scarcely half of the S/N is accumulated above 40 Hz. For the preferred final configu- 
ration, 'zero detuning, high power,' 86.1% of the S/N is above 40 Hz, 93.2% is above 30 Hz, and 98.1% is 
above 20 Hz. (Since S/N accumulates in quadrature, this means, on the other hand, that under the 'high 
frequency' configuration a template encompassing jitsi the early inspiral from 10 to 40 Hz would accumulate 
y/l — 0.533^ ~ 84.6% of the total S/N! In the 'zero detuning, high power,' configuration, integration from 10 
to 40 Hz alone would yield 50.9% of the total S/N, from 10 to 30 Hz, 36.2%, and from 10 to 20 Hz, 19.4%.) 
Since the GW amplitude is inversely proportional to the luminosity distance of the source, and the 
sensitive volume is proportional to distance cubed, the rate of detectable coalescences depends on the choice 
of low-frequency cutoff. An inspiral search that is designed with a low-frequency cutoff at the seismic wall 
would gain an increase in detection rate of /3^aj,(/iow) relative to a search with a low-frequency cutoff of 
/low This would represent almost a twofold increase in the rate of detection over a search with a fractional 
accumulated S/N of 80%, and still a 37% increase over a search with pfrac = 90%. Existing coalescing binary 
detection pipelines strive to sacrifice no more than 3% of the available S/N; this forfeits less than a 10% gain 
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Table 5: Fractional Accumulated S/N /9tiac(/iow) for Four Selected Low Frequency Cutoffs, /low — 40 Hz, 
30 Hz, 20 Hz, and 10 Hz. 



Noise model 40 Hz 30 Hz 20 Hz 10 Hz 



LHO (best S5) 

LHO (best S6) 

High frequency 

NoSRM 

BHBH 20° 

NSNS optimized 

Zero detuning, low power 

Zero detuning, high power 



100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


53.3 


80.1 


97.6 


100.0 


87.8 


95.1 


98.7 


100.0 


71.1 


84.2 


96.2 


100.0 


91.5 


96.3 


99.0 


100.0 


67.9 


80.0 


93.5 


100.0 


86.1 


93.2 


98.1 


100.0 



in detection rate. In order to satisfy this constraint, the low-frequency cutoff would have to be placed below 
30 Hz for all of the conceived Advanced LIGO configurations. 

The instantaneous GW frequency, given by Equation (1), is a power law function of time, so the amount 
of time for the GW frequency to evolve from /low to /lso depends strongly on /low The duration of a 
(1.4, 1.4) Mq inspiral is show in Figure 6(d). The inspiral takes only 25 s to evolve from 40 Hz to /lsOj but 
takes 54 s to evolve from 30 Hz to /lsOi 158 s from 20 Hz, and 1002 s from 10 Hz. 
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(d) Inspiral duration vs. /j 
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Fig. 6. — From top left: (a) noise amplitude spectral density for a variety of Advanced LIGO noise models, 
S5, and S6. (b) Normalized signal-to-noise per unit frequency, (dp^/d/)/p^, for a (1.4, 1.4) Mq inspiral. 
(c) Percentage of S/N that is accumulated from /low to /lso, relative to S/N accumulated from /low = GHz 
to /lso ■ (d) Amount of time for a NS-NS inspiral signal to evolve from frequency /low to /lso , as a function 
of /low For (a)-(c), the line style indicates which noise model was used. 
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